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Abstract 

We propose a stochastic model of web user behaviors in online social systems, and study the influence of attraction 
kernel on statistical property of user or item occurrence. Combining the different growth patterns of new entities 
and attraction patterns of old ones, different heavy-tailed distributions for popularity and activity which have been 
observed in real life, can be obtained. From a broader perspective, we explore the underlying principle governing 
the statistical feature of individual popularity and activity in online social systems and point out the potential simple 
mechanism underlying the complex dynamics of the systems. 
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1. Introduction 

Currently the WWW is undergoing a landmark revolution from the traditional Web 1.0 to Web 2.0 characterized 
by social collaborative technologies, such as social networking site, blog, Wiki and folksonomy. The social Web (or 
more specifically, online social systems), a label that includes both social networking sites (such as MyS pace and 
Facebook) and social media sites (such as Digg, CiteULike and Flickr), is changing the way content is created and 
distributed. Web-based authoring tools enable users to rapidly publish content, from stories and opinion pieces on 
weblogs, to photographs and videos on Flickr and YouTube, to advice on Yahoo\Answers, and to web discoveries on 
Del.icio.us and Furl. The availability of large-scale electronic databases has delivered us extraordinary new insights 
on the human behaviors and human dynamics on the web. The clear patterns and regularities in individual distributions 
in respect of popularity and activity in some online social systems have been revealed [1-8]. 

Evidently web users vary widely in their activity levels. Take Digg as example, some users casually browse the 
front page, voting on one or two stories. Others spend hours a day combing the web for new stories to submit, and 
voting on stories they found on Digg. Also different items on the web vary widely in their popularity. Some stories 
can attract large attention and their influence can last for a long time while most stories only can attract very little 
attention and their impact vanishes rapidly. In social media sites a tag-cloud is usually used to visuaUze the popularity 
of items, or more specifically, tags. Typical tag-clouds have between 30 and 150 tags. The popularity is represented 
using font sizes, colors or other visual clues. Fig. 1 shows a tag-cloud with terms related to Web 2.0. 

Recently much attention has been devoted to investigating the statistical feature of individual popularity and 
activity in online social systems. Their distributions show the wide-spread believed power law or general heavy- 
tailed ones intermediate between exponential and power law, such as stretched exponential or log-normal [2, 5, 9]. 
Despite the great progress made, little work is done on the underlying mechanism governing the statistical feature of 
popularity and activity in online social systems, which will be explored in the work. 

We can start our analysis from a time-ordered table of item assignments. For the system as a whole, we can define 
an intrinsic time T as the index of an item assignment into such a table, so that T runs from 1 to the number of total 
item assignments. The temporal process shown in Fig. 2 can be regarded as the process of appearance of entity m, or 
m,. And the frequency of occurrence for some user/item in the total T events can be defined as its activity/popularity. 
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Figure 1: A tag-cloud with terms related to Web 2.0 l'http://en.wikip edia.org/wiki/Tag-Cloud^ . Generally the font size of each tag is proportional to 
the logarithm of its frequency of appearance within the folksonomy. 
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Figure 2: A schematic illustration of online social systems. It is a sequence of chronologically ordered users and items. 



Thus activity measures how frequently a user performs a specific action, such as listening to music, seeing films, 
browsing posts and sending friendship invitations to other users on the web, and popularity measures how frequently 
an item (such as music, films, posts and tags) is visited by web users. Note that for items we can only measure their 
popularity, while for users, in some cases, we can measure not only their activity but also popularity. For instance 
in online social networks, users can invite other users to be their friends. Thus we can measure the activity of users 
in terms of the number of sent invitations, and can also measure the popularity of users in terms of the number of 
received invitations. 

It is natural that the number of distinct items or users increases with T, however different growth patterns can 
appear. Generally N{T) <x T~>', where y < \ implies sub-linear growth while y - \ linear growth, i.e. the generation 
probability of new individuals is a constant value (homogeneous Poisson process). Besides the more frequently an 
individual appears, the more possibly the individual will appear once again. Specifically, when an old individual joins 
the sequence, the probability that it will be a specific old individual / with previous frequency of appearance k, is 
n (ki) cc {0 < p < 1). The preference metric < yS < 1 impUes sub-linear preference while P - I linear preference. 
The case where A^(r) oc T and P - 1 corresponds to the classic Simon model [10, 11]. 

When an old individual joins the sequence, the probability that individual / with frequency kj is selected can be 
expressed as n(^i) - ^ / 2j>^- Thus we can compute the probability n(^) ^h^t ™ old individual of frequency k 
is chosen, and it is normalized by the number of individuals of frequency k that exist just before this step [12, 13]: 
n W - Yjtl^t - V A ky{t - 1) = k]/ Yji \{u '■ ku(t- 1) = k}\ ~ kP, where e, - vAky{t- 1) - represents that at time f the 
old individual whose frequency is k at time t- 1 is chosen. We use [ ■] to denote a predicate (take value of 1 if expression 
is true, else 0). Generally n(^) has significant fluctuations, particularly for large k. To reduce the noise level, instead 
of n(^) we can study the cumulative function to obtain the preference metric y6: K(k) - ~ k^'^\ 



2. Model 



Consider that users are listening to music. At a discrete time step T, a new user may appear with probability a, 
whereas with probability 1 - a an existing old user can appear. We can apply the mean field method to analytically 
obtain the probability distribution for individual popularity and activity. When < p < 1, we have 

— =(1-Q') -. (1) 
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The probability density function for f, is P,(f,) = (^/(l + f) and thus 
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The probability distribution P(k) for individual popularity and activity is 

dP{ki(t) < k) 
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which is a stretched exponential distribution. Its complementary cumulative distribution functions (CCDF) is P^k) 



exp[-(A:/A:o)'^], where c - 1 - /3 and ko is a constant. When /J — > 1, lim j 
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which is a power law distribution. Its CCDF is Pc(k) ~ k^^, where v = 1/(1 - a). The special situation of absent 
preference /3 - reduces Eq. (8) to an exponential distribution. Generally the stretched exponential distribution is 
correlative with sub-linear preference while power law distribution linear preference [14, 15]. 



3. Results and discussion 

We ground our empirical analysis on actual log data extracted from an online media site Com/tQ [5] and an online 
social network Wealinl^ [6]. Note that our approach of investigation is also applicable to other online social systems. 
Comic is located in a large Chinese university with more than 40,000 undergraduate and graduate students, and only is 



'http://comic.sjtu.edu.cn/music.asp 
^http://www. wealink.com 
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Figure 3: The CCDFs for users' activity (a) and music's popularity (b) in Comic. A stretched exponential distribution will show a straight line if 
we use In Pdk) as x-axis and as y-axis. The solid lines represent the fitted lines. 
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Figure 4: The CCDFs of users' activity and popularity in Wealink. Both distributions have a power law tail. 

accessible to the IP addresses within the university. We recorded its visiting log from October 25th, 2006 to February 
6th, 2007 in the data format: time f,/user ID number M,/music ID number m,-, i.e. a user u, listened to a song m, at time 
f,-. Users were distinguished by their IP addresses. The total number of log we obtained is 2,136,149, the number of 
different music recorded is 98,747 (mostly popular songs), and the number of users recorded is 8472. 

Wealink is a large social networking site in China whose users are mostly professionals, typically businessmen and 
office clerks. Each registered user has a profile, including his/her list of friends. For privacy reasons, the data, logged 
from May 1 1th, 2005 (the inception day for the Internet community) to August 22nd, 2007, include only each user's 
ID and list of friends, and the time of sending and accepting friendship invitations. The finial data format is time f,/user 
ID number M,/user ID number v,/flag s,. s, can take value of or 1. Sj - indicates that at time f,- a user invited 
another user v,- to be his/her friend while Si = 1 indicates that at time f, user m, accepted user v,'s invitation. During our 
data collection period, there are 273,395 sent invitations and more than 99.9% have been accepted. The total number 
of users recorded is 223, 482. Like most social networking sites, in Wealink, only when the sent friendship invitations 
are accepted, can the inviters and receivers become online friends. We can measure users' activity and popularity in 
terms of their numbers of sent and received invitations. 

In Comic individual activity and popularity can be well described by stretched exponential distribution, which is 
shown in Fig. 3. While in Fig. 4, the distribution of users' activity and popularity in Wealink has a power law tail. 
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Figure 5: k versus k for music (a) and users (b) in Comic. 
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Figure 6: Testing preferential selection for the users of sending invitations and receiving invitations in Wealink. 

We can compare the distributions of individual activity and popularity in real data with the predicted by the 
stochastic model. Fig. 5 shows the k versus k for music and users in Comic. For these two cases, the sub-linear pref- 
erential selection hypothesis can offer a good approximation. The values of /? for users and music are approximately 
0.61 and 0.79, respectively. For the CCDF of users' activity, our model gives c = \ - p ^ 0.39 and the empirical 
distribution in Fig. 3 gives c a; 0.45, while for music's popularity, our model gives c ^ 0.21 and the empirical distri- 
bution gives c ^ 0.34. Fig. 6 shows the k versus k for users in Wealink. Approximately /3 ~ 1 for users' activity and 
popularity, indicating linear preference. The appearance probabilities a of new users in the time-ordered lists of users 
of sending and receiving invitations are 0.53 and 0.35, respectively. For the CCDF of users' activity, the model gives 
V- 1/(1 -a) = 2.13, while for users' popularity the model gives V = 1.54. The power law exponents achieve proper 
agreement with the empirical results in Fig. 4. 

Fig. 7 shows the growth of the numbers of different users/music in Comic and users in Wealink with T. 
The traditional assumption, as applied in the previous deduction, is that the generation probability of new individuals 
is a constant value, i.e. N(T) oc T. However as shown in Fig. 7, the hypothesis is unrealistic to some extent. For 
the Wealink users, approximatively the slopes for senders and receivers are 1.09 and 0.97, respectively, however for 
the users/music in Comic, the growth lines show several segments with different slopes. In some cases the number 
of distinct items introduced by users after T assignments can grow approximately as N(T) oc T'*' with y < 1. 
When dealing with the evolution of the number of attributes pertaining to some collection of objects, this sub-linear 
growth is generally referred to as Heaps' law [16]. As an example, sub-linear behavior has been observed in the 
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Table 1 : Probability distributions of popularity and activity for different patterns of growth and preference. 





Linear 


growth 


Sub-linear growth 


Linear preference 






p(k) ~k-^-y 


Sub-linear preference 


P(k)~ 


r/'.exp(-i^.^) 


Fat tail 



growth of vocabulary size in texts, i.e. in the number of different words in a text as a function of the total number of 
words observed while scanning through it. For the case of English corpora, vocabulary growth exponents in the range 
0.4 <y< 0.6 have been reported [17]. 

The rate at which new items appear at time T scales as dN(T)/dT ~ T^"'. That is, new items appear less and 
less frequently, with the invention rate of new items monotonically decreasing towards zero. The approach to zero 
is however so slow that the cumulated number of items, asymptotically, does not converge to a constant value but is 
unbounded - assuming the observed trend stays valid. 

Different users or items with distinct activity or popularity may have quite different y. Recent research on the 
collaborative tagging system del.icio.us reveals that for less and less popular resources being bookmarked, the distri- 
bution of growth exponent P{y) of distinct tags gets broader and its peak shifts towards higher values of y, indicating 
that the growth behavior is becoming more and more linear [3]. 

Table 1 summarizes the probability distributions of popularity and activity for different patterns of growth and 
preferential selection which can appear in real life. For sub-linear growth and linear preference, the recent research 
shows that when the rate at which new items appear N{T) <K 1, the distribution can be approximately viewed as a 
power law P{k) ~ ^"'"^ [18]. For sub-linear growth and sub-linear preference, unfortunately the analysis for prob- 
ability distribution can lead to a rather intractable relation whose analytical solution is hard to obtain. Qualitatively 
in this case the distribution is still a fat-tailed one intermediate between exponential and power law. For some sub- 
linear growth exponent, the distribution resulted from sub-linear preference will be more homogeneous than that 
(power law) resulted from linear preference; while for some sub-linear preference exponent, the distribution resulted 
from sub-linear growth will be more heterogeneous than that (stretched exponential distribution) resulted from linear 
growth. 

The distributions of individual popularity and activity in many online social systems can follow generic heavy- 
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tailed ones, unnecessarily power law [19-24]. Several aspects of the underlying intricate dynamics may be responsible 
for the feature. Except sub-linear preference discussed above, another possible origin is the memory effect, that is, 
newly appeared individuals will appear more frequently than old ones. For example web users tend to listen to recently 
added music or apply recently added tags more frequently than old ones, which may be equivalent to the ageing effect 
of individuals. The popularity or activity of an entity will inevitably undergo a decaying process. Users become less 
active and items become less attractive over the time [25-29]. 

According to growth and preference characteristic, it is possible to predict the amount that would be devoted over 
time to given ones by measuring the data at an early time. However the method does not consider the semantics of 
popularity and why some items become more popular than others [30]. That is, popularity prediction in the presence 
of a large table of item assignments can essentially be made based on the observed early time series, while semantic 
analysis of content may be more useful when no early click-through information is known. Semantic attraction can 
lead to the initial prevalence of items and subsequent preferential selection strengthens the popularity. 
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